Fix concurrency/scaling when many Python threads do streaming using *sync* completions #14816
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Problem
I have 14 Python threads doing streaming LLM requests concurrently.
Expectation: this should be mostly network-IO-bound workload, so this should scale reasonably well.
Reality: the performance is practically serial, i.e., this executes just a little faster than L * N / nCPU, where L is the latency of one such request (streaming or not), N is the number of requests, and nCPU is the number of (v)CPUs available for the Python process.
Pre-Submission checklist
Please complete all items before asking a LiteLLM maintainer to review your PR
tests/litellm/
directory, Adding at least 1 test is a hard requirement - see detailsmake test-unit
Type
🆕 New Feature
🐛 Bug Fix
Changes
Guard
executor.submit()
withif not litellm.disable_streaming_logging
in the hot path instreaming_handler.py
's__next__()
. This is a no-brainer change, sincerun_success_logging_and_cache_storage()
is exactly a no-op iflitellm.disable_streaming_logging
is True, so submitting a no-op to an executor doesn't make any sense.Update dependency to
httpcore
to Don't hold lock unless necessary in PoolByteStream.close() encode/httpcore#1038.Make sync transport configurable via
litellm.sync_transport
. I could have avoided this by changingclient
altogether, but this is a quality of life change.Then, in my code when I use litellm, I pre-configure the HTTPTransport in the following way: